[SOUND].
This lecture is about the vector
space retrieval model.
We're going to give
an introduction to its basic idea.
In the last lecture we talked about
the different ways of designing
a retrieval model which would give
us a different the ranking function.
In this lecture, we're going to
talk about the, the specific way of
design the ramping function called
a vector space mutual model.
And we're going to give a brief
introduction to the basic idea.
Vector space model is a special case of
similarity based models
as we discussed before.
Which means,
we assume relevance is roughly
similarity between a document and a query.
Now whether this assumption is true,
is actually a question.
But in order to solve our
search problem we have to
convert the vague notion of
relevance into a more precise
definition that can be implemented
with the programming language.
So in this process we have to
make a number of assumptions.
This is the first assumption
that we make here.
Basically we assume that
if a document is more
similar to a query than another document,
then the first document would be assumed
to be more relevant than the second one.
And this is the basis for
ranking documents in this approach.
Again, it's questionable whether this is
really the best definition for relevance.
As we will see later there
are other ways to model relevance.
The first idea of vector space retrieval
model is actually very easy to understand.
Imagine a high dimensional space, where
each dimension corresponds to a term.
So, here, I show a three
dimensional space with three words,
programming, library, and presidential.
So each term, here, defines one dimension.
Now we can consider vectors in
this three dimensional space.
And we're going to assume
all our documents and
the query will be placed
in this vector space.
So, for example, one document that might
be represented at by this vector, d1.
Now this means this document probably
covers library and presidential.
But it doesn't really
talk about programming.
All right, what does this mean in
terms of presentation of document?
That just means,
we're going to look at our document
from the perspective of this vector.
We're going to ignore everything else.
Basically what we see here is
only the vector of the document.
Of course the document
has other information.
For example,
the orders of words are simply ignored and
that's because we're
assume that the words.
So with this representation
you have already seen, d1,
seems to suggest a topic in
either presidential library.
Now this is different
from another document.
Which might be represented as
a different vector, d2 here.
Now in this case, the document that
covers programming and library, but
it doesn't talk about presidential.
So what does this remind you?
Well, you can probably guess, the topic
is likely about program language and
the library is software library, library.
So this shows that by using this
vector space representation,
we can actually capture the differences
between topics of documents.
Now you can also imagine
there are other vectors.
For example,
d3 is pointing in that direction, that
might be about presidential programming.
And in fact we're going to place all
the documents in this vector space.
And they will be pointing
to all kinds of directions.
And similarly, we're going to place
our query also in this space,
as another vector.
And then we're going to measure the
similarity between the query vector and
every document vector.
So, in this case for example, we can
easily see d2 seems to be the closest of,
to this query factor and
therefore d2 will be ranked above others.
So this was a, basically the main
idea of the, the vector space model.
So to be more pri,
precise, be more precise.
Vector space model is a framework.
In this framework,
we make the following assumptions.
First, we represent a document and
query by a term vector.
So here a term can be any basic concept.
For example, a word or a phrase,
or even enneagram of characters.
Those are a sequence of
characters inside a word.
Each term is assumed to
define one dimension.
Therefore N terms.
In our vocabulary,
we define N-dimensional space.
A query vector would consist
of a number of elements
corresponding to the weights
of different terms.
Each document vector is also similar.
It has a number of elements and
each value of each element
is indicating that weight
of the corresponding term.
Here you can see,
we have seen there are N dimensions.
Therefore, there are N elements,
each corresponding to the weight
on the particular term.
So the relevance in this case would
be assume to be the similarity
between the two vectors,
therefore our range in function is
also defined as the similarity between
the query vector and document vector.
Now, if I ask you to write the program
to the internet this approach
in the search engine.
You would realize that this
was far from clear, right?
We haven't seen a lot of things in detail
therefore it's impossible to actually
write the program to implement this.
That's why I said this is a framework.
And this has to be refined
in order to actually
suggest a particular function,
that you can implement on the computer.
So, what does this framework not serve?
Well, it actually hasn't set many things
that would be required in order
to implement this function.
First, it did not say how we should define
or select the basic concepts exactly.
We clearly assume
the concepts are orthogonal,
otherwise there will be redundancy.
For example, if two synonyms are somehow
distinguished as two different concepts.
Then they would be defined
in two different dimensions.
And then that would clearly
cause a redundancy here.
Or overemphasizing of
matching this concept.
Because it would be as if you
matched the two dimensions
when you actually matched
one semantic concept.
Secondly, it did not say how we
exactly should place documents and
query in this space.
Basically I show you some examples
of query and document vectors.
But where exactly should the vector for
a particular document point to?
[INAUDIBLE] So this is equivalent
to how to define the term weights.
How do you computer use element
values in those vectors?
This is a very important
question because term weight
in the query vector indicates
the importance of term.
So depending on how you assign the weight,
you might prefer some terms
to be matched over others.
Similarly, term weight in
the document is also very meaningful.
It indicates how well the term
characterizes the document.
If you got it wrong, then you clearly
don't represent this document accurately.
Finally, how we define the similarity
measure is also not clear.
So these questions must be addressed
before we can have an operational
function that we can actually
implement using a program language.
So how do we solve these problems
is the main topic of the next lecture.
[MUSIC]

